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Abstract. In the evolution of a genome, the gene sequence is 
sometimes rearranged, for example by transposition of two adja- 
cent gene blocks. In biocombinatorics, one tries to reconstruct 
these rearrangement incidents from the resulting permutation. It 
seems that the algorithms used are too effective and find a shorter 
path than the real one. For the simplified case of adjacent transpo- 
sitions, we give expressions for the expected number of inversions 
after t random moves. This average can be much smaller than t, a 
fact that has largely been neglected so far. 



I. Background 

The genome rearrangement problem is a combinatorial problem aris- 
ing in the area of molecular evolution. Basically, it can be stated as a 
problem about permutations of gene sequences. Given a permutation n 
as a word in the symbols {1, . . . , n} (corresponding to genes), find the 
"best" path to the identity permutation when the feasible steps are 
block moves (removing a contiguous segment and inserting it some- 
where else) and block reversals (reversing the order of a segment). The 
shortest path between the two permutations is the parsimonious so- 
lution, and finding algorithms for computing the shortest (or at least 
a short) path has been given a good deal of attention in [3], [2] etc. 
Although no particular solution is more probable than the parsimo- 
nious one, the shortest distance is not necessarily the most probable 
length of a path. The number of shortest paths from the identity to 7r 
is much less than the number of paths using just a few extra steps. If a 
probabilistic model of the process is formulated, a maximum-likelihood 
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distance could be denned. We have not seen this problem considered 
in the literature. 

One approach is to try to determine the expected distance to the 
identity permutation after a random walk of given length. This seems 
to be a difficult problem. Of course, one can obtain intuition from com- 
puter simulation, but a mathematical treatment would be preferable. 

In the present paper, we simplify the model so that the only steps 
allowed are adjacent transpositions. Observe that the set of adjacent 
transpositions is the intersection of the set of block moves and the set 
of block reversals. For this simpler problem, we are able to obtain good 
lower and upper bounds by modelling the random walks in terms of a 
discrete heat equation. 



We are studying random walks of fixed length t on the Cayley graph 
of the Coxeter group A n , which is isomorphic to the symmetric group 
S n+ \. The Cayley graph has an edge between two permutations if one 
is obtained from the other by an adjacent transposition. (See [1] for 
similar problems on S n+ i and other groups.) 

The walk starts at the identity permutation 1234 . . . and consists of 
t random steps, chosen with uniform probability among the n possible 
adjacent transpositions. Let 7r be the permutation where this random 
walk stops and let inv(7r) denote the number of inversions. The shortest 
possible walk from the identity to n has length inv(7r), so this number 
is less than or equal to t. Clearly, for t — 1, all permutations have 
inv(7r) = 1, but for t > 2, a later move may cancel an inversion created 
by an earlier move. We would like to determine the expected number 
of inversions £'(inv(7r)), and we will denote it by E nt to make the 
dependence on the parameters n and t explicit. 

The set of adjacent transpositions is denoted by S = {si, . . . , s n } 
where Sj is the transposition of the positions i and i + 1. Let V nt be 
the set of all walks of length t, that is of all words in S of length t: 



Obviously V n t has cardinality n*. As the same notation is used for a 
word and its product n = s^s^ . . . Sj t , the notation V n t will be used 
also for the multiset of permutations. By counting all inversions in V n t 
we can find the average number. 



2. Introduction 



V nt 




s it : 1 < ii, . . . ,i t < n}. 



n l E, 



'nt — 



H inv(7r). 
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Using a computer, we have calculated the integers r^E^ for n,t < 10. 
These data suggested a formula of unexpected simplicity. Let Cj denote 



the Catalan number ^ (^^j . 



Theorem 2.1. For a fixed t and for all n>t, the expected number of 
inversions after t random adjacent transpositions is 



E nt — t 



n 



■i) r 



n' 




+ 4d r 



where d 2 , d 3 , <i 4 , . . . is a certain integer sequence starting by 0, 1, 9, 69, 510. 
No expression for d r is known, but the following inequalities hold. 



A direct proof of the theorem seems difficult, so our approach has 
been a reformulation of the problem to a discrete heat flow model. 

3. The heat flow analogy 

Instead of directly counting all inversions in V n t, we introduce the fol- 
lowing fine-grading. Fixing n and t, let 

Pij : = Prob(7Ti < Ti j ) 

for a random permutation n G V nt - Equivalently, 

rfPij = #{ 7r £ Vnt ■ < TTj}. 

Since every inversion is counted by one such class, we have 

E nt = Y^Vij- (!) 

i>j 

The matrices (p^) can be computed recursively. For t — 0, the set V nt 
consists of the identity permutation only, so {pif) has ones above the 
main diagonal and zeroes below, as in the leftmost matrix of Figure 1. 
The transformation to t — 1, t — 2 and so on turns out to be a heat 
flow process. The total heat is invariant, for p^ + pji = 1 by the law 
of the Excluded Middle. The main diagonal in the matrix may be left 
blank, as in the figure, or we may set all its entries to 1/2 so that the 
rule p^ + pji = 1 is satisfied. 

For any graph with real numbers (signifying temperature or heat) on 
the vertices, a heat flow process with thermal conductivity x means the 



4 



HENRIK ERIKSSON, KIMMO ERIKSSON AND JONAS SJOSTRAND 




Figure 1 . Two heat flow steps 



following. In each step, every vertex sends the fraction x of its heat to 
each of its neighbours, at the same time receiving that same fraction 
of its neighbours' heat. Two steps are shown in Fig. 1. 

Proposition 3.1. The sequence of {p if) -matrices for t = 0,1,2,... 

describes a heat flow process with conductivity x = \ on the graph 
depicted in Fig.l. The expected number of inversions, E nt , equals the 
total heat below the diagonal. 

Proof. Consider the (pjj) -matrix after t steps and let 7r be one of the 
permutations contributing to pij, that is, n satisfies 7Tj < ttj. Each 
neighbour of p^ corresponds to a move that affects either 7Tj or iTj. 
For example, the neighbour to the right, Pij+i, corresponds to the 
transposition (7Tj, 7Tj+i). When i and j are adjacent, the transposition 
(iri,irj) is possible, which explains the graph edges across the main 
diagonal. 

Except for these moves (at most four), the new pj- would be the 
same as the old p^, but now the following is true: 

Pij = Pij "I (^neighbour — Pij) i (2) 

where the sum is taken over all graph neighbours of Pij. For after, say, 
the transposition (jij, Hj+i), the p^-condition < ir'j means that we 
must have had 7Tj < iij+i, which is the pj J+ i-condition. □ 

From now on our analysis concerns the more general heat flow process 
where x is not necessarily K For the matrix entries we write Pij(x) and 
for the total heat below the diagonal we use the notation E nt (x). For 
example, Fig. 1 demonstrates that i? 41 (x) = 4x and E^ix) = 8x — 8x 2 . 

This analysis is complicated by the special edges across the diagonal. 
However, if we replace the graph of Fig. 1 by the simple grid graph of 
Fig. 2 and set all diagonal values to 1/2, then the heat flow process is 
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unchanged! For thanks to the symmetry property Pjj+i = 1 — Vj+i,j 
we have 

PjJ+i ~ Pj+ij = 1 - 2 Vj+i,j = 2(1/2 - p j+ ij). 

In other words, the loss of the neighbour across the diagonal is com- 
pensated for by the two new neighbours on the diagonal. 

\ — 1 — 1 — 1 — 1 
I I I I I 

I I I I I 

— — \ — 1 — 1 

1 I I I I 

o — o — o — \ — 1 
I I I I I 

o — o — o — o — \ 

Figure 2. Grid graph with initial values 

Proposition 3.2. The sequence of Pij{x) -matrices for t = 0, 1,2, . . . 
describes a heat flow process on the (n+1) x (n+1) grid graph depicted 
in Fig. 2. 

3.1. Hot boundary condition. The above heat flow process on a 
grid with insulated boundary can be reformulated as a heat flow process 
on the lower triangle with the hot boundary condition pa = 1/2 on the 
diagonal. This should be obvious from the fact that the only connection 
between the lower and upper triangle of the grid graph is the diagonal, 
and the property of the original process that the diagonal has constant 
temperature pa = 1/2. 

Note that the subdiagonal element Pj+ij(x) receives 2x/2 from its 
diagonal neighbours and sends back 2xpj + ij(x). The net heat transfer 
to the lower triangle is J2j( x ~ 2> x Pj+i,j( x ))i so we have the following 
result. 

Proposition 3.3. 

E nt+1 (x) = E nt (x) +nx - 2xJ2Pj+iJ 

3 

For example, in Fig. 1 we see that 8rc — 8x 2 = Ax + Ax — 2x(Ax). 
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3.2. Symmetric model and semi-infinite model. The insulated 
left and lower boundaries can be gotten rid of in two different ways, by 
symmetric extension or by just neglecting their effect. 

The reflection trick in Fig. 3 demonstrates that the diamond graph 
with hot boundaries all around is equivalent to the triangle. This model 
is what we will use for our exact solution later in this paper. 

Neglecting the effect of the left and lower boundaries is equivalent 
to moving them to infinity. Then we are left with the whole half-plane 
below an infinite hot diagonal. As we will see, this problem is not so 
difficult and we can use its solution as a lower bound. 

It is clear that the temperature at a given inner point in the diamond 
model must be at least as hot as a point at the same distance from the 
diagonal in the semi-infinite model, since the former point has heat 
flowing to it from three additional sides. Hence, solving the semi- 
infinite model gives a lower bound for the actual finite case. 




Figure 3. Triangle with hot diagonal and symmetric 
extension of the same problem 

4. Heat flow combinatorics 

Our goal in this section is to find combinatorial expressions for the 
Pij(x) -matrices describing heat flow on a triangular grid graph. As 
explained above, the triangular graph can be considered as embedded 
either in a finite diamond graph or in a semi-infinite grid graph. 

At day t — the diagonal entries are pa = | with zeroes below the 
diagonal. Referring to Fig. 1 and the recursion (p' means next day) 

Pij (x) = Pij (x) + X (^neighbour (x) ~ Pij(x)) , (3) 

it is obvious that the entries Pij(x) in step t will be polynomials in x of 
degree t (or less). We can give each coefficient in these polynomials a 
combinatorial significance. Vaguely expressed, they count journeys for 
t days from the hot boundary to the location of Pij(x). The recursion 
states that such a journey ending in a certain location a certain day 
\p[ Ax)] may have been either at the same location yesterday [pij(x)] and 
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had a resting day or at a neighbouring location [p n eighbour(#)] and had 
a travel day (the x-factor means travel day) or at the same location 
and travelled half-way to the neighbour and then back (the 
— rr-factor means round trip). 

o— •— o— •— o— •— o 

I I I I 

• • • • 

I I I I 

o— •— o— •— o— •— o 

I I I I 

• • • • 

I I I I 

o— •— o— •— o— •— o 

I I I I 

• • • • 

I I I I 

o— •— o— •— o— •— o 

Figure 4. Grid graph with mid- vertices 

In order to make our statements precise, we will modify the grid graph 
as in Fig. 4. Each of the original edges is split in two by a new mid- 
vertex. Mid-vertices are introduced for counting purposes only, they 
do not carry heat. 

Each day, the vertices on the hot boundary send out heat packets 
with the value | to their neighbours. These packets are sent on and 
on, back and forth, always multiplied by x or —x. Consider one such 
heat packet at a certain location in the morning of a certain day. What 
can happen to it during the day? 

(1) It stays on its vertex unchanged. 

(2) It travels a half-edge, gets multiplied by x, and travels the other 
half-edge to the next vertex. 

(3) It travels a half-edge, gets multiplied by —x, and returns the 
same half-edge to the same vertex. 

If the start value is \ and the journey has r travel days (type 2 or 3), 
the final value is ±^x r . The sign depends on the number of days of 
type 3, and is easily seen to be (— iy +t ~i if the journey ends at 
Hence we have the following result. 

Lemma 4.1. The coefficient of the x r -term inpij(x) is (— l) r+t ^' times 
the number of journeys from the hot boundary to in t days, r of 

which are travel days. 

4.1. The semi-infinite model. Our next step is counting the jour- 
neys specified in the lemma. This is easy in the semi-infinite case where 
all points on the A;-sub diagonal are equivalent. We define the sublevel 
of as k = i — j. The journey from sublevel k to sublevel can be 
specified by three items: 

• Out of the t days, r travel days must be chosen. This can be 
done in (*) ways. 
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• For each of the r travel days, horizontal or vertical travel must 
be chosen. This can be done in T ways. 

• A Catalan walk in 2r half-steps from sublevel k to sublevel 
must be specified. By Catalan walk we mean that sublevel 
must not be reached until the last half-step. It is well-known 
that the number of such walks is 

/2r-l\ / 2r-l \ 

\r-k J ~ \r-k-l)' 

For k — 1 this is the Catalan number C r . 

Combining this journey count with Lemma 4.1, we get the expressions 
for Pij(x) and eventually the total heat in the triangle. 

Proposition 4.2. Ater t time steps in the semi-infinite model, 



Pi,j( x ) 



1 * 

r=k 



1 



\T+k 



t 



f 2r-V 
. r—k , 



' 2r-l 
,r—k— 1, 



x 1 



where k = i — j . The total heat under the diagonal is 



E nt (x) = ntx - 



n + 1 



E(-i) r f 

r=2 \ 



t 



\2 r C r . lX r 



(4) 



(5) 



Proof. As there are n+l — k locations on sublevel k, we must do the 
following sum. 



n i t 

k=l r=k 



r+k I 



|2 r 



f 2r-V 
, r-k , 



' 2r-l 
j- — k— 1 . 



x 



'2r-V 
. r-k 



' 2r-l N 
j--k- 1, 



and the last sum simplifies to give the desired result. 



□ 



As we observed in Sec. 4.1, the total heat in the semi-infinite model 
gives a lower bound for the total heat in the finite case. In particular, 
we can plug in x = - to obtain a lower bound for the expected number 
of inversions. 

Corollary 4.3. The lower bound for E nt in Theorem 2.1 holds true. 

Proof. Substitute x — - and collect like powers of -. □ 

Remarkably, the semi-infinite model also provides an upper bound for 
the total heat in the finite case. By iterating the recursion 3.3 for the 
E nt in the finite case, we obtain the following formula. 
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Lemma 4.4. Let e t (x) denote the sum of the subdiagonal entries of 
the Pij(x) -matrix for time step t in the finite case. Then 

E nt = ntx - 2x[e t -i{x) + e t _ 2 (x) H h ei(x)\. 

We know that all Pij{x) in the semi-infinite model are less than or 
equal to the Pij(x) in the finite case. In particular, the subdiagonal 
sums must be less, so if we use them for e t (x) in the lemma above, we 
obtain an upper bound for E nt . 

Corollary 4.5. The upper bound for E nt in Theorem 2.1 holds true. 

Proof. Use the lemma together with Eq. 4, then substitute x = ^ and 
simplify. □ 

4.2. The finite case. Lemma 4.1 tells us that Pij(x) is an alternat- 
ing polynomial and that its coefficients counts journeys from the hot 
boundary to In the finite case, counting journeys is difficult when 

n < t, for there are four boundaries and wherever you start it is pos- 
sible to reach more than one of them in t days. But when n > t, the 
situation is better. 

The expression for the number of journeys with r travel days starting 
at and ending at a hot boundary used to be 







but for some that are close to two boundaries, this number will 
now increase. For the extra journeys, horizontal and vertical steps 
cannot be chosen freely, so the factor 2 r does not apply. For example, 
from (2, 1) it is possible to reach the left boundary in two steps, at 
least one of which must be horizontal. Therefore, the contribution of 
the extra journeys to E nt (x) will be of the form 

2g r x r . (6) 

The factor 2 comes from symmetry. More important than the exact 
value of the ^-numbers is the fact that they do not depend on n. 
Therefore, these correction terms become less and less important as n 
increases. For the expected number of inversions, E nt , the correction 
terms may be written as in Theorem 2.1. 

5. Open problems 

(1) Is there a nice expression for the <i r -numbers of Theorem 2.1? 

(2) Is there a nice expression for the g r -numbers of Eq. 6? 
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(3) Can the analysis be extended to adjacent block tranpositions? 

(4) Can the analysis be extended to block reversals? 

(5) If the result of some random moves is a permutation with a 
certain number of inversions, what number of moves is the most 
probable? 
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